Dan Ariely
Big Data is like teenage sex:
everyone talks about it
nobody really knows how to do it
everyone thinks everyone else is doing it
so everyone claims they are doing it.
Different meanings
Machine Learning
Artificial Intelligence
Statistical Learning
Applied Statistics
Historical context important
ML primarily from CS / EE
‘Engineering’ mentality
Tabular based
## default student balance income
## 1 No No 729.5265 44361.625
## 2 No Yes 817.1804 12106.135
## 3 No No 1073.5492 31767.139
## 4 No No 529.2506 35704.494
## 5 No No 785.6559 38463.496
## 6 No Yes 919.5885 7491.559
## 7 No No 825.5133 24905.227
## 8 No Yes 808.6675 17600.451
## 9 No No 1161.0579 37468.529
## 10 No No 0.0000 29275.268
## 11 No Yes 0.0000 21871.073
## 12 No Yes 1220.5838 13268.562
## 13 No No 237.0451 28251.695
## 14 No No 606.7423 44994.556
## 15 No No 1112.9684 23810.174
Exchangeability
Predictive accuracy
De-emphasises inference / uncertainty / explainability
Discoverability of model parameters
Example of linear models
Scaling issues
Automated ML pipelines
Software engineering
\[\begin{eqnarray*} \text{Bias} &=& \text{under-complexity error} \\ \text{Variance} &=& \text{over-complexity error} \end{eqnarray*}\]
Training-test split
\(k\)-fold
Train-validation-test split
Labelled data
\[ \begin{eqnarray*} \text{Discrete output} &\rightarrow& \text{Categorisation} \\ \text{Continuous output} &\rightarrow& \text{Regression} \end{eqnarray*} \]
Assumes data follows distributional form
Linear in parameters
## default student balance income
## 1 No No 729.5265 44361.625
## 2 No Yes 817.1804 12106.135
## 3 No No 1073.5492 31767.139
## 4 No No 529.2506 35704.494
## 5 No No 785.6559 38463.496
## 6 No Yes 919.5885 7491.559
## 7 No No 825.5133 24905.227
## 8 No Yes 808.6675 17600.451
## 9 No No 1161.0579 37468.529
## 10 No No 0.0000 29275.268
## 11 No Yes 0.0000 21871.073
## 12 No Yes 1220.5838 13268.562
## 13 No No 237.0451 28251.695
## 14 No No 606.7423 44994.556
## 15 No No 1112.9684 23810.174
Simple to understand
Highly explainable
Prone to overfitting
Ensemble of trees
Aggregate low-bias trees to reduce variance
Sample of rows, constrain splits
Self-tuning (mostly)
Ensemble of trees
Aggregate low-variance trees to reduce bias
Probably most performant approach
Tuning more involved
Geometric method
Divides ‘feature space’ into regions
–
Unlabelled data
Topic Modelling
Dimensionality Reduction
More prevalent
Supervised / Unsupervised / Semi-supervised
Google Translate
IRISHMEN AND IRISHWOMEN: In the name of God and of the dead generations from which she receives her old tradition of nationhood, Ireland, through us, summons her children to her flag and strikes for her freedom.
| doc_id | token | lemma | upos | relation |
|---|---|---|---|---|
| 1 | IRISHMEN | Irishmen | NOUN | root |
| 1 | AND | and | CCONJ | cc |
| 1 | IRISHWOMEN | IRISHWOMEN | NOUN | conj |
| 1 | : | : | PUNCT | punct |
| 1 | In | in | ADP | case |
| 1 | the | the | DET | det |
| 1 | name | name | NOUN | obl |
| 1 | of | of | ADP | case |
| 1 | God | God | PROPN | nmod |
| 1 | and | and | CCONJ | cc |
| 1 | of | of | ADP | case |
| 1 | the | the | DET | det |
| 1 | dead | dead | ADJ | amod |
| 1 | generations | generation | NOUN | conj |
| 1 | from | from | ADP | case |
Unsupervised (clustering)
Topic modelling
Lots of functionality
Assign “topics” to each “document”
Words as vectors
Semantic meaning
\[ \text{King} - \text{Male} + \text{Female} \approx \text{Queen} \]
\[ \text{Paris} - \text{France} + \text{UK} \approx \text{London} \]
Thank You